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@ Data prefetching in caches. 



@ A nnethod using CHloc (change-local) type in- 
formation is used for data prefetch (D-prefetch) de- 
cision making. This information Is stored in history 
tables H. there being one such table for each CP at, 
for example, the buffer control element (BCE). For 
each line L, H[L] indicates the information for L in H. 
Two different types of histories may be kept at H: 

(1) Xl-invalidates - At each H[L]» there is re- 
corded whether L was Xl-invalidated without refetch- 
ing. 

(2) CH(.oc - At each H[L]. there is also re- 
corded local-change history; i.e.. whether L was 
stored into since the last fetch. 

It is also possible to keep a global H at the 
^storage control element (SCE). In this case, the SCE 
^maintains a table I recording, for each line L. in- 
C\j formation l[L] recording whether L involved Xl-in van- 
adates during the last accesses by a CP. Upon a 
^ cache miss to L from a processor CPj. the SCE 
1^ prefetches some of those lines that involved XI- 
g> invalidates (indicated by I) into cache Q. if missing 
there. The management of table I is simple. When 
Oan Xl-invalidate on L occurs, e.g.. upon a store or an 
^EX fetch, the con-esponding entry is set When L is 
i ■ I accessed, e.g., upon D-fetch misses, without XI- 
invalidate. the entry in I Is reset. Another criteria for 
turning an I entry OFF is when the line Is fetched. 
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DATA PREFETCHING IN CACHES 



This invention generally relates to a method for 
data prefetching in caches according to the pre- 
amble of claim 1. 

High performance, MP computer systems are 
being developed to increase throughput by per- 
forming In parallel those operations which can run 
concurrently on separate processors. Such high 
performance, MP computer systems are character- 
ized by multiple central processor (CPs) operating 
independently and in parallel, but occasionally 
communicating with one another or with a main 
storage (MS) when data needs to be exchanged. 
The CPs and the MS have input/output (I/O) ports 
which must be connected to exchange data. 

In the type of MP system known as the tightly 
coupled multi-processor system in which each, of 
the CPs have their own caches, there exist coher- 
ence problems at various levels of the system. 
More specifically, inconsistencies can occur be- 
tween adjacent levels of a memory hierarchy. The 
multiple caches could, for example, possess dif- 
ferent versions of the same data because one of 
the CPs has modified its copy. It is therefore nec- 
essary for each processor's cache to know what 
has happened to lines that may be in several 
caches at the same time. In a MP system where 
there are many CPs sharing the same main stor- 
age, each CP is required to obtain the most re- 
cently updated version of data according to ar- 
chitecture specifications when access is issued. 
This requirement necessitates constant monitoring 
of data consistency among caches. 

. A number of solutions have been proposed to 
the cache coherence problem. Early solutions are 
described by C. K. Tang in "Cache System Design 
in the Tightly Coupled Multiprocessor System". 
Proceedings of tlie AFtPS (1976), and L M. Cen- 
sier and P. Feautrier in "A New Solution to Coher- 
ence Problems in Multicache Systems", IEEE 
Transactions on Computers . Dec. 1978, pp. 1112 to 
1118. Censier et al. describe a scheme allowing 
shared writable . data to exist in multiple caches 
which uses a centralized global access authoriza- 
tion table. However, as the authors acknowledge in 
their Conclusion section, they were not aware of 
similar approaches as described by Tang two 
years earlier. While Tang proposed using copy 
directories of caches to maintain status, Censier et 
al. proposed to tag each memory block with similar 
status bits. 

A typical approach to multi-processor (MP) 
cache coherence is as follows. When a processor 
needs to modify (store into) a cache line, it makes 
sure that copies of the line in remote caches are 
invalidated first This is achieved either by broad- 



casting the store signal to remote processors (for 
instance, through a common bus connecting all 
processors) or by requesting for permission from a 
centralized storage function (for instance, the stor- 

5 age control element (SCE) in IBM 3081 systems). 
The process of invalidating a cache line that may 
or may not exist in remote . processor caches is 
called cross-interrogate invalidate (Xl-invalidate). 
There have been various design techniques pro- 

70 posed for the reduction of such Xl-invalidate sig- 
nals. For example, in IBM/3081 systems, exclusiv- 
ity (EX) states at processor caches are used to 
record the information that the associated lines are 
not resident in remote caches and do not require 

75 Xl-invalidate activities when stored into from the 
caches owning the exclusivity states. 

One inherent overhead in conventional MP 
cache designs is the extra misses due to Xl-invali- 
dates. That is, a processor access to its cache may 

20 find the line missing, which would not have oc- 
curred if not Xl-in validated by a remote processor 
before the access. This problem is becoming more 
serious when large caches ai'e' used with more 
central processors (CPs). Simulation results indi- 

25 cate that such extra misses are mostly on data 
lines (D-lines), as opposed to instruction lines (I- 
lines). With large caches, miss ratios are rather 
satisfactory in a uni-processor (UP) environment. 
To reduce the extra misses due to remote stores, 

30 one approach is to prefetch D-lines that are poten- 
tially invalidated by remote CPs, 

It is therefore the object of the present inven- 
tion to provide a method for data prefetching in 
multi-processor caches based on store information 

35 thereby achieving a significant reduction on data 
misses in multi-processors with large caches. 

The solution is described in the characterizing 
part of claim 1 . 
- According to the invention, a mechanism using 

40 history information is used for data prefetch (D- 
prefetch) decision making. This information is 
stored in history tables H, there being one such 
table for each CP at. for example, the buffer control 
element (BCE). For each line L, H[L] indicates the 

45 information for L in H. Two different types of his- 
tories may be kept at H: 

(1) Xl-invalidates - At each H[L]. there is 
recorded whether - L was Xl-invalidated , without 
prefetching. 

50 (2) CHloc - At each H[L], there is also re- 

corded change-local history, i.e., whether L was 
stored into since the last fetch. 

It is also possible to keep a global H at the 
storage control element (SCE). In this case, the 
: SCE maintains a table I recording, for each line L, 
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information l[L] recording whether L involved XI- 
invalidates during the last accesses by a CP. Upon 
a cache miss to L from a processor CPi, the SCE 
prefetches some of those lines that involved XI- 
invalidates (indicated by I) into cache Cj. if missing 
there. The management of table I is simple. When 
an Xl-invalidate on L occurs, e.g.. upon a store or 
an EX fetch, the corresponding entry is set. When 
L is accessed, e.g., upon D-fetch misses, without 
Xl-invalidate. the entry in 1 is reset. Another criteria 
for turning an I entry OFF is when the line is 
fetched, e.g., on demand or upon prefetch. 

The foregoing and other objects, aspects and 
advantages of the invention will be better under- 
stood from the following detailed description of an 
embodiment of the invention with reference to the 
drawings, in which: 

f^ig 1 IS a tDiock diagram of a multi-proces- 
sor ryciorn in which trvr present invention may be ' 
used. 

f! ; r rs a lai snowing the results on level 
one rrir.v.-t v»fi:ri u^* i^'i 2WMP configurations; 

f 3 r; L t'l .v* diagram showing an or- 
ganiza:-.'- a r.N; : . ;!om in which local history 
tablet li/f n\^i --\ti r.>.-: 

Fi>> -J li u WJ. i-howing the results using 
local hat . L..i u history table Hj; 

Fij b ti o I - diagram showing an or- 
ganizattoT c( ^ MP •^ rtim in which a global history 
table is maintafncj. and 

Fig. 6 ic ;j tat ic showing the results using a 
global tablo H 

Referring now to the drawings, and more par- 
ticularly to Figure 1. there Is illustrated in block 
diagram form a rru It-processor (MP) system of the 
type in which the in/crtion may be used. The MP 
system comprises lou' central processors (CPo. 
CP,, CP2. and CPji 10. 11. 12. and 13 in which 
each CP incliides an instruction execution (IE) unit 
14. 15, 16. and 17 and buffer control, unit (BCE) 20, 
21. 22, and 23. respectively. Each IE unit includes 
hardware and microcode that issue instructions that 
require the fetching and storing of operands in 
main storage (MS) 50. 

The IE units 14 to 17 begin a fetching or 
storing operation by issuing a fetch or store com- 
mand to their respective cache controls BCEs 20 to 
23. which include a processor store through (ST) 
cache with its associated processor cache directory 
(PD) and ad processor cache controls which are • 
exclusively used by their associated CPs 10 to 13. 
The CP generally issues a fetch or store command 
for each doubleword (DW) unit required by an 
operand. If the cache line containing the DW is in 
the PD. which is a cache hit, the DW is fetched or 
stored in the cache in accordance with the com- 
mand. For an operand fetch hit in cache, the stor- 
age access is completed without any need to go 



outside of the BCE. Occasionally, the required DW 
is not in the cache, which results in a cache miss. 
Before the IE fetch or store command can be 
completed, the DW must be fetched from the main 
5 storage. To do this, the BCE generates a cor- 
responding fetch or store miss command which 
requests the storage control element (SCE) 30 to 
obtain from main storage 50 a line unit of data 
having the DW required by the IE unit. The line 
10 unit will be located in the main storage 50 on a line 
boundary, but the required DW will be the first DW 
in the fetched line to be returned to the requesting 
BCE in order to keep the IE request going before 
the completion of the missed line transfer. 
IS SCE 30 connects to the CPs 10 to 13 and main 

storage 50. Main storage 50 is comprised of a 
. plurality of basic storage module (BSM) controllers 
BSCo. BSCi, BSCa, and BSCa (51. 52, 53. and 54, 
respectively) in which each basic storage controller 
20 connects to two BSMs 0 (60. 62. 64. and 66) and 1 
(61, 63. 65. and 67). The four BSCs 51 to 54 are 
each connected to the SCE 30. In prior systems, 
the SCE 30 contains four copy directories (CDs) 
31. 32, 33, and 34, each containing an. image of the 
25 contents of a corresponding processor cache direc- 
. tory (PD) in one of the BCEs in a manner similar to 
that described in U.S. Patent No. 4.394,731 to 
Flusche et al. 

A doubleword wide bidirectional data bus is 
. 30 provided between each BSM 60 to 67 In main 
storage and corresponding SCE port, and from 
SCE ports to I/O channel processor 40 and each "of 
the corresponding CPs 10 to 13. Along with the 
data busses, there are also separate sets of com- 
as mand busses for control and address signals. 
VVhen a CP encounters a cache miss for a DW 
access request. Its BCE initiates a line access 
request to main storage by sending a miss com- 
mand to SCE 30, which then reissues the com- 
40 mand to -a required BSM in main storage. In the 
event of a BSM busy condition. SCE 30 will save 
the request in a command queue and will reissue it 
at a later time when the required BSM 60 to 67 
becomes available: SCE 30 also sequences the 
45 main storage commands' in an orderly fashion so 
that all commands to a particular BSM are issued 
in.first-in. first-out (FIFO) order, except when a 
cache conflict is found by its XI logic. During the 
normal sequence of handling a main storage re- 
50 quest. SCE 30 constantly monitors the status of 
main storage, analyzes the interrogation results of 
protection key and all cache directories, examines 
updated status of all pending commands currently 
being held in SCE- 30. and also looks for any new 
55 BCE commands that may be waiting in BCE 20 to 
23 to be received by SCE 30. 

SCE 30 maintains a plurality of store stacks 
(SSo. SSi, SSa. and SSa) 35. 36, 37. and 38. each 
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for holding of main storage store requests of up to 
16 DWs for a corresponding CP. SCE 30 keeps 
enough directory information for the store stacks 
for the indication of main storage addresses and 
validity. When a store stack risks overflow, SCE 30 
sends a priority request to the associated BCE 20 
to 23 to hold the sending of more store requests 
until the BCE receives a later signal from SCE 30 
clearing the store stack full condition. Data in the 
store stacks are updated to main storage with 
appropriate scheduling maintaining the incoming 
order within each "store stack. A line fetch request 
from a CP is held by SCE 30 until the SCE makes 
sure that all existing stores to the line in the store 
stacks have been sent to the associated BSM 60 to 
67. 

One inherent overhead in conventional MP 
cache designs is the extra misses due to Xl-invali- 
dates. When a line is stored by one processor, 
copies of the line need to be invalidated from 
remote caches at the proper point. For a given 
cache design and a given workload, let mk be the 
number of instructions per occurrence that a refer- 
ence does not find the line in local cache in a k- 
way MP configuration. Hence mi may be consid- 
ered as the uni-processor (UP) miss behavior with- 
out MP effects. Let 

A^(m) = ^ ^ ^ 



be the number of Instructions per extra misses in a 
k-way MP {as compared with a UP) system due to 
Xl-invalldates. In a later discussion, superscripts 
are used to denote the behavior for specific refer- 
ence types. For example. mk°^ denotes the number 
of instructions per D-Fetch cache miss in a k-way 
MP system. Experiments have shown that such 
extra misses were mainly on D-llnes. When cache 
size grows or when more processors are added, 
such extra misses will have a higher percentage on 
MIPS. For instance, in certain environments, extra 
miss ratios alone may cost over 4% of system 
performance. Also, experimental data shows that 
almost all of such extra cache misses are covered 
by data lines, since processors rarely store into 
instruction lines. 

One way for reducing such overhead due to 
extra misses is data prefetching. It is clear that the 
concerned extra misses result only due to Xl-invali- 
dates; therefore, prefetching may be carried out on 
Xl-invalidate histories. Comparing this approach 
with more general data prefetching schemes, it has 
the following advantages: 

(1) Data prefetching in general is ' not as 
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effective as instruction prefetching (l-prefetching). 
More general D-prefetching may result in excessive 
. burden to the control functions and to the memory 
. traffic. Performing D-prefetching only on Xl-invali- 
5 date related activities will cut down prefetching 
frequencies, with better prefetching behavior. 

(2) Effective D-prefetching involves histories. 
XI related histories are useful for MP system de- 
sign and, therefore, XI histories provide benefit not 
TO limited to D-prefetching. 

Consider first history tables H provided for D- 
prefetch decision making. There is one such table 
, for each CP (e.g.. at the BCE). As will be described 
below, a global history table H can be. kept at the 
75 SCE. For each line L. H[L] is used to indicate the 
information for line L in table H. When there is a 
table for each CP, Hr is used to denote the history 
table for processor CPj. Two types of histories may 
be kept at H: 

20 (i) Xl-lnvalidates - At each H[L), a record is 

kept as to whether L was Xl-invalidated without 
prefetching. . 

(ii) CHloc At each H[L]. a record is kept of 
the Change-Local history; i.e.,.. whether L was 

25 stored into since the last fetch. 

In the following, evaluation results on different 
algorithms are presented. Simulations were done 
using a two-way MP (2WMP) memory reference 
trace. Only 51 2K processor cache memory with 

30 four-way set-associativity and 64 byte lines were 
considered. The base MP algorithm simulated was 
read only (RO) aggressive, but with conditional 
exclusive (EX) D-fetches (I.e., D-fetch EX when the 
line is not in any cache). With UP and 2WMP 

35 configurations, the results shown in Figure 2 were 
obtained on cache misses (not counting Xl-activi- 
ties), where the superscript IF indicates instruction 
fetches, the superscript DF indicates data 
(operand) fetches, and the superscript DS indicates 

40 data (operand) stores. Close to three fourths of A2- 
(m)( = 131.7) was due to D-fetches. with close to 
one fourth due to D-stores. 

Figure 3 illustrates an organization in which the 
BCE 20i of each CP. lOj, maintains a local history 

45 table Hi, 70;. for data prefetching purposes. All lines 
in the main storage (MS) are grouped into fixed 
size blocks, with each block containing T consecu- 
tive lines. For each line L. Bu denotes the block 
covering line L. The invention will be illustrated by 

50 first considering each Hi as an Invalidate History 
Table. Each Hi is a bit-vector of fixed size. For 
each memory line L. a bit entry Hj[L] is set via the 
line address. Initially, all bits of H| are reset to zero, 
(a) The bit Hi[L] is set (turned on) when the 

55 BCE invalidates L from its cache through a remote 
request. 

The bits of H; are tumed off in the following 
situation: 
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(b) the bit H|[L] is reset when the line L is 
fetched into the cache of CP| as a D-line. 

Upon a D-line miss in the cache of CPi. the 
BCE carries out prefetch decisions as follows: 

(c) Each line L in block Bl (including L 
itself) will be fetched, starting with line U into 
cache If the bit Hi[L ] is set if the line L is not 
resident in the cache. In a preferred embodiment, 
the block Bl consists of the line sequentially pre- 
ceding line L and the next two lines sequentially 
following the line L 

In principal, the Invalidate History Table H| 
monitors those D-lines that are invalidated from the 
local cache and triggers prefetch operations when 
a D-miss occurs. 

In the experiments, a hash table for each H| 
with 32K entries was used. Each entry in the hash 
table was a single bit. If the number of consecutive 
lines T = 4, the results shown in Rgure 4 are ob- 
tained. These results show a reduction of A2(m) by 
47,3% over the standard MP design without data 
prefetching. Furthermore, among the D-misses 
(once every 58.7 instructions), only 21.4% (once 
every 274.2 instructions) resulted in effective data 
prefetches (i.e., those with at least one non-de- 
manded line prefetched)* Among the effective data 
prefetches, 72.6% (93.2%, respectively) result in 
the prefetch of only one line (up to two lines, 
respectively), with an average of 1.14 lines 
prefetched each time. 

In the above experiment, if a single hash table 
H at, for example, the SCE is used instead of one 
for each CP, results are obtained which are very 
close to what is observed with local Xl-invalidate 
tables. Rgure 5 illustrates the organization for such 
a design. The operations for D-prefetch with a 
global Invalidate History Table H, 71, is very similar 
to the ones with local tables: 

(d) The bit H[L] is set when the SCE issues 
and Xl-invalidate of L to any of the CPs due to a 
request from a CP (which itself may cause a miss 
fetch). 

(e) The bit H[L] is reset when the line L is 
fetched into a CP cache as a D-Iine. This does not 
include the situation of a miss fetch described in 
step (d) above. 

(f) Upon a D-line miss fetch from CP. each 
line L in block Bl (including L itself) will be 
fetched, starting with L, into the requesting cache if 
the bit H[L ] is set and if L is not resident there. 

In step (f) above, the SCE may filter out unnec- 
essary prefetching (of the lines that are already in 
the target CP cache) by examining the correspond- 
ing copy directory. In certain designs in which the 
SCE does not maintain resident information for 
local caches, the BCE may simply send the SCE a 
tag, along with the D-miss request, indicating its 
cache residency of those lines in the associated 
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block Bl. 

In MP cache designs, an Xl-invalidate operation 
may be activated on an anticipatory basis. For 
instance, in IBM 3081 and 3090 systems, a D-fetch 

5 miss from a CP may trigger Xl-invalidate of the line 
in a remote cache, anticipating subsequent stores 
into the line after the current D-fetch operation. It is 
possible for the history table H to monitor only D- 
store activities (versus Xl-invalidate activities). For 

10 instance, for- a global history table H described 
above, step (d) may be replaced with step (d ) as 
follows: 

(6) The bit H[L] is set when the SCE re- 
ceives a D-store of L (which itself may cause a 
. 75 miss fetch) from any of the CPs. 

In this case. H is simply a global Change-Local 
history, table which monitors those lines that are 
actively modified currently. Simulation results of D- 
prefetching with a global Change-Local history ta- 
20 ble H are summarized in Figure 6. As the number 
of CPs grows, it can be burdensome for the SCE to 
check the global Change-Local history table H for 
each D-store from the CPs. This burden may be 
; reduced, through various design techniques. For 
25 instance, as described in U.S. Patent No. 4,391 .731 
to Flusche et aL, exclusivity (EX) locking may be 
used for modern MP cache coherence control. 
Each D-store is required to obtain EX status on the 
line first, which will guarantee that the line be Xi- 
ao invalidated from remote caches. Hence, step ■ (d ) 
may be implemented such that H[L] is set only 
upon the first D-store to line L after the requesting 
CP obtains EX status .on the line L. It is unnec- 
essary for the SCE to set the entry H[L] upon 
35 subsequent D-stores since the entry is already set 
by earlier D-stores in normal conditions. We also 
notice that, since H is only used as a device for 
assisting prefetch decisions, it is not necessary to 
precisely maintain H according to strict rules upon 
40 situations that may complicate the design. With the 
number of consecutive lines T = 3 (T = 4. respec- 
tively), A2(m) was reduced by 51.9% (61.4%, re- 
spectively) with 207.56 (201.46, : respectively) 
instructions per effective prefetching, and with an 
45 average of 1.4 (1.7, respectively) lines prefetched 
for each effective prefetch. Comparing these re- 
sults with those for local history tables, we find that 
the global history table approach generally per- 
forms better. This is partly due to the. fact that a 
50 global directory can generally better * capture the 
tendency of the line being reused dynamically. 

From the results on a 2WMP system, extra 
cache misses due to Xl-invalidates were substan- 
tially reduced. A greater perfonmance benefit can 
56 be obtained with more processors in, the MP sys- 
tem. One important factor in the results is the 
relatively low frequencies (more than 200 instruc- 
tions per occurrence) of effective prefetch. As a 

5 
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result, it is possible to prefetch more than one line 
each time. Such multiple data prefetches are more 
appropriate in an environment In which line fetch 
bandwidth is very wide (e.g.. one line per cycle). 
The prefetched lines may be moved into a stage s 
buffer waiting for cache putaways (and possibly 
cache directory lookups also). Even when more 
CPs are involved, a drastically higher data prefetch 
rate is not expected, since the prefetching is done 
only on cache misses. ' io 

Two mechanisms for have been described for 
maintaining the histories for D-prefetching, one with 
local Xl-invalidate histories and the other with glo- 
bal Change-Local history. Depending upon particu- 
lar Implementations, these histories or other similar 75 
ones may be used with various combinations. It is. 
worth noticing that such a history itself may serve 
the purpose (or otner kinds of cache optimizations. 
For instance, thtj Change-Locai histories (either lo- 
cal Of rjkotxii? may provide information for the op- '20 
limi.M:on cac^e status assignments in MP 
cac*K' o-r a r. no scribed in co-pending patent 
appJ.r:iti:>n S ' o: tic 07/232,722 filed on August 
16. lOoo i.j Lij.wng Liu for "An Improved -Mul- 
ti procrr.s r ZuOit Using a Local Change State" 25 
(IBM Coc^f t v0967-<:€3).. 

Do:a rtffictcMirKj in MPs is normally associated 
with tnc (ocroarrr o' XI activities. However, the 
subject invurmon significantly improves XI prob- 
lems vrith procx^f coherence mechanisms. 30 

While the invcniion has been described in 
terms of tv;-o preicred embodiment, those skilled in 
the art wilt rcccxjnize that the Invention can be 
practiced witti modification within the spirit and 
scope of iht; uppt..nded claims. 35 



Claims 

1. Method of data prefetching in caches of a 40 
multi-processor system comprising a plurality of 
processors (CP), a shared main storage (MS) and a 
storage control element (SCE), each of said pro- 
cessors having a k^cal cache memory and a buffer 
control element, characterized by steps of: 45 
providing a local history table (H) at each of said 
processors (CP), said local history tables contain- 
ing for each line (L) in cache memory (C) a record 
(I [L]) as to whether the line was Xl-invalidated 
without prefetching and a record as to whether the so 
line was stored into since the last fetch; 
setting an entry in a local history table (H) when a 
line is Xl-invalidated from a processor (CP) and 
resetting the entry when a processor fetches a line 
(L) into its cache memory (C); and 55 
when an entry for a line to be accessed by a 
processor (CP) is set, prefetching from main stor- 
age (MS) a block of data containing a predeter- 
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mined number of consecutive lines of data includ- 
ing said line to be accessed, 

2. Method of data prefetching recited in claim 

1, characterized in 

that said main storage (MS)' is divided by blocks, 
each block containing said predetermined number 
of consecutive - lines of data, and the step of 
prefetching is performed by prefetching a block of 
data from said main storage containing said line to 
be accessed. / - 

3., Method of data prefetching belted in claim 

2, characterized in 

that said predetermined number of consecutive 
lines of data comprises a line sequentially preced- 
ing said line (L) and two lines ne>ct sequentially 
following said line. to be accessed: 

4. Method of data prefetching recited In claim 
1 . characterized in 
. that said step of providing local history tables is 
^ performed by storing said local history tables (H) in 
the buffer control elements (SCE) for each of said 
processors (CP). 

' 5. Method of data prefetching as set forth in 
one of claims 1 - 4, characterized ; ' 
by the step of setting an entry in said history table 
(H) when a line is XHnvalidated from a processor 
(CP) or fetched with exclusive status by a proces- 
sor (CP) and resetting the entry when a processor 
(CP) fetches a line into its cache memory. 

6. Method of data prefetching as set forth in 
claim 1 or 5, characterized by 
the step of setting an entry in said history table 
when the storage control element (SCE) receives a 
data store for a line (L). 
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@ A method using CHloc (change-local) type in- 
formation is used for data prefetch (D-prefetch) de- 
cision making. This information Is stored in history 
tables H, there being one such table for each CP at, 
for example, the. buffer control element (BCE). For 
each line L, H[L] indicates the information for L in H. 
Two different types of histories may be kept at H: 

(1) Xl-invalidates - At each H[L]. there is recorded 
whether L was Xl-invalidated without refetching. 

(2) CHloc - At each H[L], there is also recorded 
local-change history, i.e., whether L was stored 
into since the last fetch. 

It is also possible to keep a global H at the 
storage control element (SCE). In this case, the SCE 
maintains a table I recording, for each line L, in- 
formation l[L] recording whether L involved Xl-invaii- 
dates during the last accesses by a CP. Upon a 
cache miss to L from a processor CPj, the SCE 
prefetches some of those lines' that involved Xl- 
invalidates (indicated by I) into cache Q. If missing 
there. The management of table I is simple. When 
an Xl-invalidate on L occurs, e.g., upon a store or an 
EX fetch, the corresponding entry is set. When L is 
accessed, e.g.. upon D-fetch misses, without Xl- 
invalidate, the entry in I is reset. Another criteria for 
turning an I entry OFF is when, the line is fetched, 
e.g., on demand or upon prefetch. 
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